{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "ElXOa7R7g37i"
   },
   "source": [
    "#  Putting Multitask Learning to Work\n",
    "\n",
    "This notebook walks through the creation of multitask models on MUV [1]. The goal is to demonstrate how multitask methods can provide improved performance in situations with little or very unbalanced data.\n",
    "\n",
    "## Colab\n",
    "\n",
    "This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.\n",
    "\n",
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/Putting_Multitask_Learning_to_Work.ipynb)\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 188
    },
    "colab_type": "code",
    "id": "3HHM8X9t_NPp",
    "outputId": "1da9ace2-4f46-4e1e-93cf-97eae4ef8bb5"
   },
   "outputs": [],
   "source": [
    "!pip install --pre deepchem\n",
    "import deepchem\n",
    "deepchem.__version__"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "9Ow2nQtZg37p"
   },
   "source": [
    "The MUV dataset is a challenging benchmark in molecular design that consists of 17 different \"targets\" where there are only a few \"active\" compounds per target. There are 93,087 compounds in total, yet no task has more than 30 active compounds, and many have even less. Training a model with such a small number of positive examples is very challenging.  Multitask models address this by training a single model that predicts all the different targets at once. If a feature is useful for predicting one task, it often is useful for predicting several other tasks as well. Each added task makes it easier to learn important features, which improves performance on other tasks [2].\n",
    "\n",
    "To get started, let's load the MUV dataset.  The MoleculeNet loader function automatically splits it into training, validation, and test sets.  Because there are so few positive examples, we use stratified splitting to ensure the test set has enough of them to evaluate."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 85
    },
    "colab_type": "code",
    "id": "FGi-ZEfSg37q",
    "outputId": "c806cf75-0666-4d5d-a8cd-8f5470286017"
   },
   "outputs": [],
   "source": [
    "import deepchem as dc\n",
    "import numpy as np\n",
    "\n",
    "tasks, datasets, transformers = dc.molnet.load_muv(split='stratified')\n",
    "train_dataset, valid_dataset, test_dataset = datasets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "6nRCpb08g375"
   },
   "source": [
    "Now let's train a model on it.  We'll use a MultitaskClassifier, which is a simple stack of fully connected layers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "BvfbTbsEg376"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.0004961589723825455"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "n_tasks = len(tasks)\n",
    "n_features = train_dataset.get_data_shape()[0]\n",
    "model = dc.models.MultitaskClassifier(n_tasks, n_features)\n",
    "model.fit(train_dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see how well it does on the test set.  We loop over the 17 tasks and compute the ROC AUC for each one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "MUV-466 0.9207684040838259\n",
      "MUV-548 0.7480655561526062\n",
      "MUV-600 0.9927995701235895\n",
      "MUV-644 0.9974207415368082\n",
      "MUV-652 0.7823481998925309\n",
      "MUV-689 0.6636843990686011\n",
      "MUV-692 0.6319093677234462\n",
      "MUV-712 0.7787838079885365\n",
      "MUV-713 0.7910711087229088\n",
      "MUV-733 0.4401307540748701\n",
      "MUV-737 0.34679383843811573\n",
      "MUV-810 0.9564571019165323\n",
      "MUV-832 0.9991044241447251\n",
      "MUV-846 0.7519881783987103\n",
      "MUV-852 0.8516747268493642\n",
      "MUV-858 0.5906591438294824\n",
      "MUV-859 0.5962954008166774\n"
     ]
    }
   ],
   "source": [
    "y_true = test_dataset.y\n",
    "y_pred = model.predict(test_dataset)\n",
    "metric = dc.metrics.roc_auc_score\n",
    "for i in range(n_tasks):\n",
    "    score = metric(dc.metrics.to_one_hot(y_true[:,i]), y_pred[:,i])\n",
    "    print(tasks[i], score)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Not bad!  Recall that random guessing would produce a ROC AUC score of 0.5, and a perfect predictor would score 1.0.  Most of the tasks did much better than random guessing, and many of them are above 0.9.\n",
    "\n",
    "# Congratulations! Time to join the Community!\n",
    "\n",
    "Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:\n",
    "\n",
    "## Star DeepChem on [GitHub](https://github.com/deepchem/deepchem)\n",
    "This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.\n",
    "\n",
    "## Join the DeepChem Gitter\n",
    "The DeepChem [Gitter](https://gitter.im/deepchem/Lobby) hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!\n",
    "\n",
    "# Bibliography\n",
    "\n",
    "[1] https://pubs.acs.org/doi/10.1021/ci8002649\n",
    "\n",
    "[2] https://pubs.acs.org/doi/abs/10.1021/acs.jcim.7b00146"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "name": "05_Putting_Multitask_Learning_to_Work.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
